Goto

Collaborating Authors

 regularized problem




Path following algorithms for \ell_2 -regularized M -estimation with approximation guarantee

Neural Information Processing Systems

Many modern machine learning algorithms are formulated as regularized M-estimation problems, in which a regularization (tuning) parameter controls a trade-off between model fit to the training data and model complexity. To select the ``best'' tuning parameter value that achieves a good trade-off, an approximated solution path needs to be computed. In practice, this is often done through selecting a grid of tuning parameter values and solving the regularized problem at the selected grid points. However, given any desired level of accuracy, it is often not clear how to choose the grid points and also how accurately one should solve the regularized problems at the selected gird points, both of which can greatly impact the overall amount of computation. In the context of $\ell_2$-regularized $M$-estimation problem, we propose a novel grid point selection scheme and an adaptive stopping criterion for any given optimization algorithm that produces an approximated solution path with approximation error guarantee. Theoretically, we prove that the proposed solution path can approximate the exact solution path to arbitrary level of accuracy, while saving the overall computation as much as possible. Numerical results also corroborate with our theoretical analysis.


Connecting Optimization and Regularization Paths

Neural Information Processing Systems

Consequently, a line of work has focused on characterizing the implicit biases of global optimum reached by various optimization algorithms. For example, Gunasekar et al. [ 2017 ] consider the problem of matrix factorization and show that gradient descent (GD) on un-regularized objective converges to the minimum nuclear norm solution.


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The paper examines the problem of approximating Kernel functions by random features. The main result is that using an L1 regularisation one can use only O(1/\epsilon) random features that to obtain an \epsilon accurate approximation to kernel functions. The paper develops Sparse random features algorithm which is analogous to functional gradient descent in boosting. The algorithm require O(1/\epsilon) random features which compares extremely favourably with the state of the art which requires O1/\epsilon^2) features.



Reviewer 1: The regularized version of the FR problem is a geodesically convex optimization problem over the feasible

Neural Information Processing Systems

We would like to thank all referees for their appreciation of our results and the useful feedback. The KL divergence (confined to the subspace of Gaussian distributions) is not induced by any Riemannian metric. We propose to elaborate on these connections in the introduction. As we pointed out in our response to Rev. 1, solving the KL problem (12) using Theorem 3.2 takes Because (12) is non-convex, the gradient descent algorithm cannot guarantee to converge to global minimum of (12). Thank you also for your minor suggestions, which we plan to address in the revised version of the manuscript.


Path following algorithms for \ell_2 -regularized M -estimation with approximation guarantee

Neural Information Processing Systems

Many modern machine learning algorithms are formulated as regularized M-estimation problems, in which a regularization (tuning) parameter controls a trade-off between model fit to the training data and model complexity. To select the best'' tuning parameter value that achieves a good trade-off, an approximated solution path needs to be computed. In practice, this is often done through selecting a grid of tuning parameter values and solving the regularized problem at the selected grid points. However, given any desired level of accuracy, it is often not clear how to choose the grid points and also how accurately one should solve the regularized problems at the selected gird points, both of which can greatly impact the overall amount of computation. In the context of \ell_2 -regularized M -estimation problem, we propose a novel grid point selection scheme and an adaptive stopping criterion for any given optimization algorithm that produces an approximated solution path with approximation error guarantee.


Accelerated Training for Matrix-norm Regularization: A Boosting Approach

Neural Information Processing Systems

Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. In this paper, we propose a boosting method for regularized learning that guarantees ɛ accuracy within O(1/ɛ) iterations. Performance is further accelerated by interlacing boosting with fixed-rank local optimization--exploiting a simpler local objective than previous work. The proposed method yields state-of-the-art performance on large-scale problems. We also demonstrate an application to latent multiview learning for which we provide the first efficient weak-oracle.


Sparsity via Sparse Group $k$-max Regularization

Tao, Qinghua, Xi, Xiangming, Xu, Jun, Suykens, Johan A. K.

arXiv.org Machine Learning

For the linear inverse problem with sparsity constraints, the $l_0$ regularized problem is NP-hard, and existing approaches either utilize greedy algorithms to find almost-optimal solutions or to approximate the $l_0$ regularization with its convex counterparts. In this paper, we propose a novel and concise regularization, namely the sparse group $k$-max regularization, which can not only simultaneously enhance the group-wise and in-group sparsity, but also casts no additional restraints on the magnitude of variables in each group, which is especially important for variables at different scales, so that it approximate the $l_0$ norm more closely. We also establish an iterative soft thresholding algorithm with local optimality conditions and complexity analysis provided. Through numerical experiments on both synthetic and real-world datasets, we verify the effectiveness and flexibility of the proposed method.